Skip to content

fix(esc): worker-thread iteration for OpenAI-compat streaming (LiteLLM hang fix)#148

Merged
ericleepi314 merged 1 commit into
mainfrom
fix/esc-cancel-openai-compat-worker-queue
May 16, 2026
Merged

fix(esc): worker-thread iteration for OpenAI-compat streaming (LiteLLM hang fix)#148
ericleepi314 merged 1 commit into
mainfrom
fix/esc-cancel-openai-compat-worker-queue

Conversation

@ericleepi314
Copy link
Copy Markdown
Collaborator

Summary

Why this is needed

The TypeScript reference at typescript/src/services/api/openaiShim.ts works because JavaScript's native fetch + AbortSignal integration — reader.read() rejects with AbortError when the signal trips. Python's sync httpx has no equivalent. Two earlier attempts confirmed the diagnosis:

  • signal.pthread_kill(main_thread, SIGINT) DID unblock the iteration but propagated KeyboardInterrupt uncaught past every layer of exception handling, exiting the entire python -m src.cli process. SIGINT lands at the next bytecode boundary, rarely inside our narrow except KeyboardInterrupt window.
  • response.close() from another thread is best-effort across SDK + proxy + httpx version combinations. For LiteLLM-proxied chunked-transfer streams, the kernel-level socket read keeps receiving data even after the client-side response object is "closed."

The worker+queue pattern is the only portable design that makes the main thread's response time independent of the SDK's cooperation.

Changes

  • src/providers/openai_compatible.py:

    • SDK iteration moved to a daemon worker thread
    • Main thread polls queue.Queue with 100ms timeout, checks guard.aborted between ticks
    • Worker catches BaseException so KeyboardInterrupt/SystemExit are routed back to main for re-raise
    • Abort sites use guard.raise_if_post_aborted() instead of hardcoded AbortError("user_interrupt") to preserve non-default abort reasons
    • Method docstring rewritten to explain the worker indirection (preventing future "simplifications")
  • tests/test_openai_compat_abort_signal.py — 2 new regression tests:

    • test_abort_unwinds_promptly_even_when_iterator_never_returns — uses a _StuckStream whose __iter__ blocks on a never-set threading.Event. Mutation-verified: reverting to pre-fix code made pytest hang past an 8-second gtimeout — confirming this test catches the LiteLLM regression.
    • test_normal_completion_still_captures_final_usage — pins the drain invariant. OpenAI emits usage stats in the final empty-choices chunk; the main thread must process every queued chunk before breaking on _DONE, otherwise ↓ N tokens REPL spinner would silently lose count.

Test plan

  • 2 new regression tests pass; stuck-stream test is mutation-verified
  • 34 abort-related tests pass (8 OpenAI-compat + 17 StreamAbortGuard + 5 Anthropic + 4 Minimax)
  • 4438 broader related tests pass
  • Critic subagent review: APPROVE (after applying docstring + abort-reason cleanups)
  • User manually verified on their LiteLLM → Claude Opus 4.7 stack — ESC now works promptly

Trade-offs

  • Wasted bandwidth on orphaned reads: the worker keeps consuming bytes until the upstream connection closes naturally. Practical impact: at most one cancelled stream of bandwidth per ESC press, reclaimed when the upstream finishes. Memory: queue.Queue is unbounded, so a worst-case "stuck iterator AND chunks flowing" path could accumulate megabytes. Flagged as a follow-up; not blocking on current evidence (the user's repro has the iterator stuck, not flowing).
  • 100 ms response cadence: ESC takes at most one tick to unwind. Imperceptible to users.

Out of scope (explicit user request)

Not extending the worker+queue to AnthropicProvider or MinimaxProvider. Both use the anthropic SDK and work reliably against direct connections; no user has reported proxy-buffering issues against them. The same pattern ports cleanly if needed in the future.

🤖 Generated with Claude Code

PRs #145 / #146 added mid-stream ESC cancellation for OpenAI-compatible
providers via a ``stream.response.close()`` listener. That works for
direct connections, but a user on LiteLLM-proxied Claude Opus 4.7
reported it still hangs indefinitely — ``response.close()`` from the
listener thread doesn't actually terminate the SDK's blocking socket
read in that configuration, so the iterator keeps draining bytes and
the agent loop's cancel boundary never sees the abort.

Two earlier attempts confirmed the diagnosis:

* Adding ``signal.pthread_kill(main_thread, SIGINT)`` to the listener
  DID unblock the iteration but propagated ``KeyboardInterrupt``
  uncaught past every layer of exception handling, exiting the entire
  ``python -m src.cli`` process. The signal hits the main thread at
  the next bytecode boundary, which is rarely inside our narrow
  ``except KeyboardInterrupt`` window.
* The TypeScript reference at
  ``typescript/src/services/api/openaiShim.ts`` works because
  JavaScript's native ``fetch + AbortSignal`` integration makes
  ``reader.read()`` reject with ``AbortError`` when the signal trips.
  Python's sync ``httpx`` has no equivalent.

Fix: decouple the main thread's response time from the SDK's
cooperation. The SDK iteration runs on a daemon worker thread that
pushes chunks into a ``queue.Queue``. The main thread polls with a
100 ms timeout and re-checks ``guard.aborted`` between ticks. On
abort, raise ``AbortError`` immediately and orphan the worker — it
dies when the underlying connection eventually closes. The cost is
some wasted bandwidth on the orphaned read; the benefit is that ESC
unwinds within ~100 ms regardless of LiteLLM / httpx / chunked-
transfer behavior.

Specifics:

* Worker pushes each chunk into the queue, then a ``_DONE`` sentinel
  in a ``finally`` so the main thread can break the loop on either
  normal exhaustion or worker exception.
* Worker catches ``BaseException`` so ``KeyboardInterrupt`` /
  ``SystemExit`` from the worker thread are routed back to the
  main thread for re-raise rather than silently dropped.
* Main-thread loop checks abort in the ``Queue.Empty`` branch (bounds
  ESC latency at one tick) AND after processing each chunk (preserves
  the chunk we just received, matching the in-loop-check semantics
  pinned by the existing chunk-list regression test).
* Abort sites call ``guard.raise_if_post_aborted()`` instead of
  hardcoding ``AbortError("user_interrupt")`` so a non-default abort
  reason on the signal (future ``"rate_limit_backoff"`` etc.) is
  preserved.
* Updated method docstring explains the worker indirection and the
  Python-vs-JS-fetch contrast so the next engineer touching this
  doesn't "simplify" the queue away.

Two new regression tests:

* ``test_abort_unwinds_promptly_even_when_iterator_never_returns``
  uses a ``_StuckStream`` whose ``__iter__`` blocks on a never-set
  Event. Without the worker+queue, the main thread would block on
  ``next(stream)`` forever; mutation-verified by reverting to the
  pre-fix code and watching pytest hang past an 8-second gtimeout.
* ``test_normal_completion_still_captures_final_usage`` pins the
  drain invariant: the main thread must process the final
  empty-choices/populated-usage chunk before breaking on ``_DONE``,
  otherwise OpenAI's ``stream_options.include_usage=True`` token
  counts would silently regress.

Anthropic and Minimax providers are unchanged. They use the
``anthropic`` SDK whose ``client.messages.stream(...)`` works reliably
against direct connections, and no user has reported the LiteLLM-style
proxy buffering against them. If that changes, the same worker+queue
pattern ports cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant